목차
1. 나무위키에서 한국인 메이저리거 선수들 데이터 크롤링
2. KBO 홈페이지에 선수이름 검색하여 크롤링
3. KBO 홈페이지에서 2020년 (타율순)상위10명의 타자 데이터 크롤링
위의 그림은 구글에서 '한국인 메이저리거'를 검색해서 나오는 나무위키 사이트이다.
나무위키 사이트에서 빨간색 네모박스 안의 한국인 메이저리거 선수들 이름을 가져와보자
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
# widen the width
pd.set_option('display.max_colwidth', 1000)
url을 통해 크롤링 할 수 있는 requests라는 모듈과, 크롤링해서 가져온 데이터를 가공하기 쉽게 만들어주는 BeautifulSoup 모듈 그리고 데이터 편집에 용이한 pandas 모듈을 가져온다.
GET방식을 써서 크롤링을 해보자
# get korean majorleaguer data from namuwiki
url = " https://namu.wiki/w/%ED%95%9C%EA%B5%AD%EC%9D%B8%20%EB%A9%94%EC%9D%B4%EC%A0%80%EB%A6%AC%EA%B1%B0 "
req = requests.get(url)
html = req.text
soup = bs(html, 'html.parser')
korean_majorleaguer = []
for td in soup('td'):
for a in td('a'):
if ('wiki-link-internal' in a['class']
and len(a.get_text()) == 3
and a.get_text() not in korean_majorleaguer):
korean_majorleaguer.append(a.get_text())
korean_majorleaguer = sorted(korean_majorleaguer)
player_position = ['내야수', '투수', '투수', '투수', '투수', '내야수', '외야수', '투수',
'투수', '내야수', '투수', '투수', '투수', '투수', '투수', '내야수',
'투수', '투수', '투수', '내야수', '내야수', '외야수', '내야수']
player_birth_year = ['1987', '1969', '1988', '1979', '1977', '1995', '1988', '1983', '1987', '1986', '1973', '1980',
'1980', '1977', '1982', '1982', '1971', '1976', '1975', '1991', '1979', '1982', '1987']
korean_majorleaguer_df = pd.DataFrame(data={'Name': korean_majorleaguer, 'Birth_year': player_birth_year,
'Position': player_position})
korean_majorleaguer_df
| Name | Birth_year | Position | |
|---|---|---|---|
| 0 | 강정호 | 1987 | 내야수 |
| 1 | 구대성 | 1969 | 투수 |
| 2 | 김광현 | 1988 | 투수 |
| 3 | 김병현 | 1979 | 투수 |
| 4 | 김선우 | 1977 | 투수 |
| 5 | 김하성 | 1995 | 내야수 |
| 6 | 김현수 | 1988 | 외야수 |
| 7 | 류제국 | 1983 | 투수 |
| 8 | 류현진 | 1987 | 투수 |
| 9 | 박병호 | 1986 | 내야수 |
| 10 | 박찬호 | 1973 | 투수 |
| 11 | 백차승 | 1980 | 투수 |
| 12 | 봉중근 | 1980 | 투수 |
| 13 | 서재응 | 1977 | 투수 |
| 14 | 오승환 | 1982 | 투수 |
| 15 | 이대호 | 1982 | 내야수 |
| 16 | 이상훈 | 1971 | 투수 |
| 17 | 임창용 | 1976 | 투수 |
| 18 | 조진호 | 1975 | 투수 |
| 19 | 최지만 | 1991 | 내야수 |
| 20 | 최희섭 | 1979 | 내야수 |
| 21 | 추신수 | 1982 | 외야수 |
| 22 | 황재균 | 1987 | 내야수 |
나무위키에서 크롤링한 데이터를 위와 같이 가져올 수 있다.
위의 그림은 KBO 홈페이지이다.
selenium을 이용해서 위 그림의 빨간 네모 '선수조회'에서 한국인 메이저리거 선수들의 통산기록을 가져와보자
player_info_url = []
# execute chrome web browser
path = './chromedriver'
for player in korean_majorleaguer_df.loc[:]['Name'].to_list():
driver = webdriver.Chrome(path)
driver.get('https://www.koreabaseball.com/Player/Search.aspx')
# 1. search player, 2. click the search button, 3. count number of person who has same name
element = driver.find_element_by_id("cphContents_cphContents_cphContents_txtSearchPlayerName")
element.send_keys(player)
driver.find_element_by_xpath('//*[@id="cphContents_cphContents_cphContents_btnSearch"]').click()
time.sleep(2)
count_search_result = driver.find_element_by_xpath('//*[@id="cphContents_cphContents_cphContents_udpRecord"]/div[2]/p/strong/span')
# cases according to number of player result
if count_search_result.text == '0':
player_info_url.append('NO_INFO_URL')
elif count_search_result.text == '1':
driver.find_element_by_xpath('//*[@id="cphContents_cphContents_cphContents_udpRecord"]/div[2]/table/tbody/tr/td[2]/a').click()
player_info_url.append(driver.current_url)
else:
temp_ls = []
for count in range(int(count_search_result.text)):
driver.find_element_by_xpath('//*[@id="cphContents_cphContents_cphContents_udpRecord"]/div[2]/table/tbody/tr[{0}]/td[2]/a'.format(count+1)).click()
temp_ls.append(driver.current_url)
time.sleep(2)
driver.back()
time.sleep(2)
# search and click
element = driver.find_element_by_id("cphContents_cphContents_cphContents_txtSearchPlayerName")
element.send_keys(player)
driver.find_element_by_xpath('//*[@id="cphContents_cphContents_cphContents_btnSearch"]').click()
time.sleep(2)
player_info_url.append(temp_ls)
driver.close()
korean_majorleaguer_df['info_url'] = player_info_url
korean_majorleaguer_df
| Name | Birth_year | Position | info_url | |
|---|---|---|---|---|
| 0 | 강정호 | 1987 | 내야수 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=76325 |
| 1 | 구대성 | 1969 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93715 |
| 2 | 김광현 | 1988 | 투수 | [https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=94233, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=77829] |
| 3 | 김병현 | 1979 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=62349 |
| 4 | 김선우 | 1977 | 투수 | [https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=93362, https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=68416, https://www.koreabaseball.com/Futures/Player/HitterDetail.aspx?playerId=51604, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=78232] |
| 5 | 김하성 | 1995 | 내야수 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=64300 |
| 6 | 김현수 | 1988 | 외야수 | [https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=72442, https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76290, https://www.koreabaseball.com/Record/Player/PitcherDetail/Basic.aspx?playerId=69516] |
| 7 | 류제국 | 1983 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=63111 |
| 8 | 류현진 | 1987 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=76715 |
| 9 | 박병호 | 1986 | 내야수 | [https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=75125, https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=89630] |
| 10 | 박찬호 | 1973 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=50112, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=50112, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=50112] |
| 11 | 백차승 | 1980 | 투수 | NO_INFO_URL |
| 12 | 봉중근 | 1980 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=77147 |
| 13 | 서재응 | 1977 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=78640 |
| 14 | 오승환 | 1982 | 투수 | https://www.koreabaseball.com/Record/Player/PitcherDetail/Basic.aspx?playerId=75421 |
| 15 | 이대호 | 1982 | 내야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=71564 |
| 16 | 이상훈 | 1971 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=90851, https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=99156, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93147, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93147, https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=60724, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=80091] |
| 17 | 임창용 | 1976 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=95657 |
| 18 | 조진호 | 1975 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=73830 |
| 19 | 최지만 | 1991 | 내야수 | NO_INFO_URL |
| 20 | 최희섭 | 1979 | 내야수 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=77623 |
| 21 | 추신수 | 1982 | 외야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=51817 |
| 22 | 황재균 | 1987 | 내야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76313 |
얻어진 데이터를 보면 url이 여러개인 선수와 url이 아예 없는 선수가 보인다.
url이 여러개인 이유는 동명이인 때문이고, url이 없는 이유는 KBO에서 경기를 뛴 적이 없어 통산기록 사이트가 없기 때문이다.
selenium을 이용해서 크롤링을 해보았는데 시간이 너무 오래걸려 post방법으로 다시 크롤링 해보고자 한다.
또한, 선수들의 Birth_year데이터를 이용해 중복 url없이 한국인 메이저리거 선수들의 url만 가져올 수 있도록 해보자
import re
base_url = 'https://www.koreabaseball.com'
player_info_url = []
for idx, player_name in enumerate(korean_majorleaguer_df.loc[:]['Name'].to_list()):
url_ls = []
url = 'https://www.koreabaseball.com/Player/Search.aspx'
# _headers has information of accessing person or program
_headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.192 Safari/537.36'
}
_data = {
'ctl00$ctl00$ctl00$cphContents$cphContents$ScriptManager1': 'ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$udpRecord|ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$btnSearch',
'__VIEWSTATE': '/wEPDwUJMTMxNzk5NDM2D2QWAmYPZBYCZg9kFgJmD2QWAgIBD2QWAmYPZBYCAgEPZBYCAgUPZBYCAgEPZBYCZg9kFghmDxAPFgYeDURhdGFUZXh0RmllbGQFCEZJUlNUX05NHg5EYXRhVmFsdWVGaWVsZAUEVF9JRB4LXyFEYXRhQm91bmRnZBAVCwrtjIAg7ISg7YOdAk5DBuuRkOyCsAJLVAJMRwbtgqTsm4ADS0lBBuuhr+uNsAbsgrzshLEDU1NHBu2VnO2ZlBULAAJOQwJPQgJLVAJMRwJXTwJIVAJMVAJTUwJTSwJISBQrAwtnZ2dnZ2dnZ2dnZxYBZmQCAQ8QZGQWAWZkAgIPD2QWAh4Kb25rZXlwcmVzcwV5aWYoZXZlbnQua2V5Q29kZSA9PSAxMyl7X19kb1Bvc3RCYWNrKCdjdGwwMCRjdGwwMCRjdGwwMCRjcGhDb250ZW50cyRjcGhDb250ZW50cyRjcGhDb250ZW50cyRidG5TZWFyY2gnLCcnKTtyZXR1cm4gZmFsc2U7fWQCBQ8PFgYeCVBhZ2VJbmRleAUBMR4IUGFnZVNpemUFAjIwHg1Ub3RhbFJvd0NvdW50ZmQWHAIBDw8WAh4HVmlzaWJsZWhkZAIDDw8WAh8HaGRkAgUPDxYGHgRUZXh0BQExHghDc3NDbGFzcwUCb24eBF8hU0ICAmRkAgcPDxYIHwgFATIfCWUfCgICHwdoZGQCCQ8PFggfCAUBMx8JZR8KAgIfB2hkZAILDw8WCB8IBQE0HwllHwoCAh8HaGRkAg0PDxYIHwgFATUfCWUfCgICHwdoZGQCDw8PFgYfCWUfCgICHwdoZGQCEQ8PFgYfCWUfCgICHwdoZGQCEw8PFgYfCWUfCgICHwdoZGQCFQ8PFgYfCWUfCgICHwdoZGQCFw8PFgYfCWUfCgICHwdoZGQCGQ8PFgIfB2hkZAIbDw8WAh8HaGRkZBPJzJ33mOjJ7KF9HH4CC8Tqp/b7mZ0aZXc4BJ9coh4S',
'__VIEWSTATEGENERATOR': '6942A5F7',
'__EVENTVALIDATION': '/wEdABX7pAfSpLXhapqlKWQnTuGdqUK6+mRsdxhnNCA+WG212kNgeBSbNKLKcKTg82HGZ5MBG/Pf3I75C3rKX8xVtEwo+kQ8AqPLE4RrhO6ZFCqeBK7PzKYNHJn1ix41G9IhOQuJXHqRUXRUZklBZmuFWpsNdCA+bt9lrvQ7Bnt3sOnVYKuMptg8kI77SbU4++55lNksVPJEqMmnsPvHB9dIJGpf5hXVFrJQTuW7aWcjKnYReCWPyJkVhvXuOBYs5qqOoClVZUXxd/uiJhLNEa+9XxmnYsasFt+7FfwAs6g8a36bapjfFDeu2NDc13pCkdcMGuGM1LdB1BdIShiN9H1skvV4F1JgdjTRh5oeAN6Yg5M8ELpmrE3ZItDca36f0/kynHQIcBFwjnAKA7Co0KA6ls0JuuwNcLXb9Hf3j9q4Bnx0xSIEsp38v5tQJGTakYgKU9HPyZ6Du6F4V7ocMxWqriJyG65+WeXn6FgnALIFFcA4KQ==',
'ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$hfPage': 1,
'ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$txtSearchPlayerName': player_name,
'__ASYNCPOST': 'true',
'ctl00$ctl00$ctl00$cphContents$cphContents$cphContents$btnSearch': '검색'
}
html = requests.post(url, data=_data, headers=_headers)
soup = bs(html.text, 'html.parser')
extra_url = ''
for td in soup.find_all('td'):
if not(td.a is None):
extra_url = td.a.get('href')
# get url information if birth year is matched
if korean_majorleaguer_df.loc[idx]['Birth_year'] in td.text:
url_ls.append(base_url + extra_url)
player_info_url.append(url_ls)
korean_majorleaguer_df['info_url'] = player_info_url
korean_majorleaguer_df
| Name | Birth_year | Position | info_url | |
|---|---|---|---|---|
| 0 | 강정호 | 1987 | 내야수 | [https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=76325] |
| 1 | 구대성 | 1969 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93715] |
| 2 | 김광현 | 1988 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=77829] |
| 3 | 김병현 | 1979 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=62349] |
| 4 | 김선우 | 1977 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=78232] |
| 5 | 김하성 | 1995 | 내야수 | [https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=64300] |
| 6 | 김현수 | 1988 | 외야수 | [https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76290] |
| 7 | 류제국 | 1983 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=63111] |
| 8 | 류현진 | 1987 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=76715] |
| 9 | 박병호 | 1986 | 내야수 | [https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=75125] |
| 10 | 박찬호 | 1973 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=62761] |
| 11 | 백차승 | 1980 | 투수 | [] |
| 12 | 봉중근 | 1980 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=77147] |
| 13 | 서재응 | 1977 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=78640] |
| 14 | 오승환 | 1982 | 투수 | [https://www.koreabaseball.com/Record/Player/PitcherDetail/Basic.aspx?playerId=75421] |
| 15 | 이대호 | 1982 | 내야수 | [https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=71564] |
| 16 | 이상훈 | 1971 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93147, https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=90851] |
| 17 | 임창용 | 1976 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=95657] |
| 18 | 조진호 | 1975 | 투수 | [https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=73830] |
| 19 | 최지만 | 1991 | 내야수 | [] |
| 20 | 최희섭 | 1979 | 내야수 | [https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=77623] |
| 21 | 추신수 | 1982 | 외야수 | [https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=51817] |
| 22 | 황재균 | 1987 | 내야수 | [https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76313] |
동명이인이 있는 경우를 고려해서 선수의 이름과 태어난 년도를 이용해서 크롤링 해보았으나 이상훈 선수 url만 2개가 추출되었다.
직접 확인해본 결과 두번째 url이 메이저리거에 갔다온 이상훈 선수의 데이터였기 때문에 첫번째 url을 제거하자
url = korean_majorleaguer_df[korean_majorleaguer_df['Name']=='이상훈'].iloc[0]['info_url'][-1]
korean_majorleaguer_df[korean_majorleaguer_df['Name']=='이상훈'].iloc[0]['info_url'] = url
for index in korean_majorleaguer_df.index:
if korean_majorleaguer_df.loc[index]['info_url']:
korean_majorleaguer_df['info_url'][index] = korean_majorleaguer_df.iloc[index]['info_url'][0]
else:
korean_majorleaguer_df['info_url'][index] = ''
korean_majorleaguer_df
| Name | Birth_year | Position | info_url | |
|---|---|---|---|---|
| 0 | 강정호 | 1987 | 내야수 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=76325 |
| 1 | 구대성 | 1969 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93715 |
| 2 | 김광현 | 1988 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=77829 |
| 3 | 김병현 | 1979 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=62349 |
| 4 | 김선우 | 1977 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=78232 |
| 5 | 김하성 | 1995 | 내야수 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=64300 |
| 6 | 김현수 | 1988 | 외야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76290 |
| 7 | 류제국 | 1983 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=63111 |
| 8 | 류현진 | 1987 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=76715 |
| 9 | 박병호 | 1986 | 내야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=75125 |
| 10 | 박찬호 | 1973 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=62761 |
| 11 | 백차승 | 1980 | 투수 | |
| 12 | 봉중근 | 1980 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=77147 |
| 13 | 서재응 | 1977 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=78640 |
| 14 | 오승환 | 1982 | 투수 | https://www.koreabaseball.com/Record/Player/PitcherDetail/Basic.aspx?playerId=75421 |
| 15 | 이대호 | 1982 | 내야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=71564 |
| 16 | 이상훈 | 1971 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=93147 |
| 17 | 임창용 | 1976 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=95657 |
| 18 | 조진호 | 1975 | 투수 | https://www.koreabaseball.com/Record/Retire/Pitcher.aspx?playerId=73830 |
| 19 | 최지만 | 1991 | 내야수 | |
| 20 | 최희섭 | 1979 | 내야수 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=77623 |
| 21 | 추신수 | 1982 | 외야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=51817 |
| 22 | 황재균 | 1987 | 내야수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76313 |
이상훈 선수의 url 데이터가 수정된 최종 한국인 메이저리거 url 데이터가 만들어졌다.
위의 url 데이터를 바탕으로 한국인 메이저리거 선수들의 KBO시절 통산기록을 가져와보자
korean_majorleaguer_dict = {}
for idx, url in enumerate(korean_majorleaguer_df.loc[:]['info_url']):
# to distinguish whether the player is retired or not
check_retired = True
# if there's an url
if url:
if ('HitterDetail' in url) or ('PitcherDetail' in url):
url = url.replace('Basic', 'Total')
check_retired = False
req = requests.get(url)
html = req.text
soup = bs(html, 'html.parser')
player_column_list = ['연도', '팀명']
if 'Hitter' in url:
player_temp_table = soup.find('table', {'class': 'tData01 tt mb5'}) if check_retired else soup.find('table', {'class': 'tbl tt mb5'})
player_column_list_tag = player_temp_table.find_all('th')
else:
player_temp_table = soup.find('table', {'class': 'tData01 tt mb5'}) if check_retired else soup.find('table', {'class': 'tbl tt mgb5'})
player_column_list_tag = player_temp_table.find_all('th')
for col in player_column_list_tag:
player_a_tag = col.find_all('a')
for a in player_a_tag:
player_column_list.append(a.get('title'))
temp_data = pd.DataFrame(columns=player_column_list)
i = 0
index = 0
col_len = len(player_column_list)
while True:
try:
temp_data.loc[i] = [x.text for x in player_temp_table.find_all('td')[index : index + col_len]]
i += 1
index += col_len
except:
break
# gather only career data
# there is empty information in url of 추신수, so change the url as ''
try:
player_temp_table = soup.find('tfoot', {'class': 'play_record'})
player_column_tag = player_temp_table.find_all('th')
career = [x.text for x in player_column_tag]
career.insert(0, '통산')
temp_data.loc[i] = career
korean_majorleaguer_dict[korean_majorleaguer_df.loc[idx]['Name']] = temp_data
except:
korean_majorleaguer_df['info_url'][idx] = ''
korean_majorleaguer_dict[player_name] = ''
# if there's no valid url
else:
korean_majorleaguer_dict[korean_majorleaguer_df.loc[idx]['Name']] = ''
korean_majorleaguer_dict['강정호']
| 연도 | 팀명 | 타율 | 경기 | 타수 | 득점 | 안타 | 2루타 | 3루타 | 홈런 | 루타 | 타점 | 도루 | 도루실패 | 볼넷 | 사구 | 삼진 | 병살타 | 실책 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2006 | 현대 | 0.150 | 10 | 20 | 1 | 3 | 1 | 0 | 0 | 4 | 1 | 0 | 1 | 0 | 0 | 8 | 1 | 3 |
| 1 | 2007 | 현대 | 0.133 | 20 | 15 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 5 | 1 | 0 |
| 2 | 2008 | 우리 | 0.271 | 116 | 362 | 36 | 98 | 18 | 1 | 8 | 142 | 47 | 3 | 1 | 31 | 5 | 65 | 12 | 13 |
| 3 | 2009 | 히어로즈 | 0.286 | 133 | 476 | 73 | 136 | 33 | 2 | 23 | 242 | 81 | 3 | 2 | 45 | 4 | 81 | 18 | 15 |
| 4 | 2010 | 넥센 | 0.301 | 133 | 449 | 60 | 135 | 30 | 2 | 12 | 205 | 58 | 2 | 2 | 61 | 7 | 87 | 14 | 23 |
| 5 | 2011 | 넥센 | 0.282 | 123 | 444 | 53 | 125 | 22 | 2 | 9 | 178 | 63 | 4 | 6 | 43 | 9 | 62 | 12 | 13 |
| 6 | 2012 | 넥센 | 0.314 | 124 | 436 | 77 | 137 | 32 | 0 | 25 | 244 | 82 | 21 | 5 | 71 | 6 | 78 | 16 | 12 |
| 7 | 2013 | 넥센 | 0.291 | 126 | 450 | 67 | 131 | 21 | 1 | 22 | 220 | 96 | 15 | 8 | 68 | 6 | 109 | 18 | 15 |
| 8 | 2014 | 넥센 | 0.356 | 117 | 418 | 103 | 149 | 36 | 2 | 40 | 309 | 117 | 3 | 3 | 68 | 13 | 106 | 8 | 9 |
| 9 | 통산 | 통산 | 0.298 | 902 | 3070 | 470 | 916 | 193 | 10 | 139 | 1546 | 545 | 51 | 28 | 387 | 50 | 601 | 100 | 103 |
위의 코드처럼 korean_majorleaguer_dict에 한국인 메이저리거 이름을 넣으면 통산기록을 가져올 수 있다.
KBO 홈페이지의 기록실에서 빨간 네모 박스 안에 있는 타율이 좋은 상위 10명 선수들의 데이터를 가져오려고한다.
base_url = 'https://www.koreabaseball.com/Record/Player/HitterBasic/Basic1.aspx'
req = requests.get(base_url)
html = req.text
soup = bs(html, 'html.parser')
top10_hitter_url_df = pd.DataFrame(columns=['Name', 'info_url'])
count = 0
top10_hitter_temp_table = soup.find('table', {'class': 'tData01 tt'})
for url in top10_hitter_temp_table.find_all('a'):
temp_tag = url.get('href')
if 'playerId' in temp_tag and count < 10:
top10_hitter_url_df.loc[count] = [url.text, 'https://www.koreabaseball.com' + url.get('href')]
count += 1
top10_hitter_url_df
| Name | info_url | |
|---|---|---|
| 0 | 최형우 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=72443 |
| 1 | 손아섭 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=77532 |
| 2 | 로하스 | https://www.koreabaseball.com/Record/Retire/Hitter.aspx?playerId=67025 |
| 3 | 박민우 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=62907 |
| 4 | 페르난데스 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=69209 |
| 5 | 이정후 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=67341 |
| 6 | 허경민 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=79240 |
| 7 | 김현수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76290 |
| 8 | 강백호 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=68050 |
| 9 | 양의지 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76232 |
2020년 타율순 상위 10명 타자 선수들의 url데이터를 가져왔는데, 로하스 선수는 2021년 은퇴를 하여 다른 선수들과 column값들이 다르기 때문에 제외시키고 그 다음 타율이 높은 선수를 포함시키자
또한, 해당 url을 통해 통산기록을 가져와보자
top10_hitter_dict = {}
for idx, url in enumerate(top10_hitter_url_df.loc[:]['info_url']):
if 'Basic' in url:
url = url.replace('Basic', 'Total')
req = requests.get(url)
html = req.text
soup = bs(html, 'html.parser')
top10_hitter_table = soup.find('table', {'class': 'tbl tt mb5'})
top10_hitter_column_list = ['연도', '팀명']
for col in top10_hitter_table.find_all('th'):
for a in col.find_all('a'):
top10_hitter_column_list.append(a.get('title'))
temp_data = pd.DataFrame(columns=top10_hitter_column_list)
i = 0
index = 0
while True:
try:
temp_data.loc[i] = [x.text for x in top10_hitter_table.find_all('td')[index : index + len(top10_hitter_column_list)]]
i += 1
index += len(top10_hitter_column_list)
except:
break
top10_hitter_table = soup.find('tfoot', {'class': 'play_record'})
career = [x.text for x in top10_hitter_table.find_all('th')]
career.insert(0, '통산')
temp_data.loc[i] = career
top10_hitter_dict[top10_hitter_url_df.loc[idx]['Name']] = temp_data
# exclude retired player because they have different index with unretired player...
else:
top10_hitter_url_df.drop([idx], inplace=True)
top10_hitter_url_df = top10_hitter_url_df.reset_index(drop=True)
top10_hitter_url_df
| Name | info_url | |
|---|---|---|
| 0 | 최형우 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=72443 |
| 1 | 손아섭 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=77532 |
| 2 | 박민우 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=62907 |
| 3 | 페르난데스 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=69209 |
| 4 | 이정후 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=67341 |
| 5 | 허경민 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=79240 |
| 6 | 김현수 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76290 |
| 7 | 강백호 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=68050 |
| 8 | 양의지 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=76232 |
| 9 | 나성범 | https://www.koreabaseball.com/Record/Player/HitterDetail/Basic.aspx?playerId=62947 |
은퇴한 로하스 선수를 제외시키고 그 다음으로 타율이 좋은 나성범 선수를 데이터에 추가시켰다.
top10_hitter_dict['최형우']
| 연도 | 팀명 | 타율 | 경기 | 타석 | 타수 | 득점 | 안타 | 2루타 | 3루타 | ... | 타점 | 도루 | 도루실패 | 볼넷 | 사구 | 삼진 | 병살타 | 장타율 | 출루율 | 실책 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2002 | 삼성 | 0.400 | 4 | 6 | 5 | 0 | 2 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.800 | 0.400 | 0 |
| 1 | 2004 | 삼성 | 0.000 | 2 | 2 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.000 | 0.000 | 0 |
| 2 | 2008 | 삼성 | 0.276 | 126 | 440 | 384 | 68 | 106 | 24 | 0 | ... | 71 | 3 | 0 | 48 | 6 | 76 | 9 | 0.487 | 0.364 | 3 |
| 3 | 2009 | 삼성 | 0.284 | 113 | 481 | 415 | 70 | 118 | 24 | 0 | ... | 83 | 1 | 2 | 53 | 9 | 60 | 12 | 0.508 | 0.375 | 3 |
| 4 | 2010 | 삼성 | 0.279 | 121 | 506 | 420 | 71 | 117 | 29 | 1 | ... | 97 | 4 | 0 | 71 | 8 | 91 | 13 | 0.524 | 0.389 | 2 |
| 5 | 2011 | 삼성 | 0.340 | 133 | 571 | 480 | 80 | 163 | 37 | 3 | ... | 118 | 4 | 3 | 76 | 5 | 88 | 8 | 0.617 | 0.427 | 1 |
| 6 | 2012 | 삼성 | 0.271 | 125 | 531 | 461 | 51 | 125 | 27 | 1 | ... | 77 | 2 | 1 | 55 | 5 | 76 | 20 | 0.425 | 0.348 | 3 |
| 7 | 2013 | 삼성 | 0.305 | 128 | 573 | 511 | 80 | 156 | 28 | 0 | ... | 98 | 2 | 1 | 47 | 7 | 91 | 5 | 0.530 | 0.366 | 1 |
| 8 | 2014 | 삼성 | 0.356 | 113 | 493 | 430 | 92 | 153 | 33 | 0 | ... | 100 | 4 | 2 | 50 | 7 | 62 | 11 | 0.649 | 0.426 | 1 |
| 9 | 2015 | 삼성 | 0.318 | 144 | 637 | 547 | 94 | 174 | 33 | 1 | ... | 123 | 2 | 5 | 73 | 9 | 101 | 13 | 0.563 | 0.402 | 2 |
| 10 | 2016 | 삼성 | 0.376 | 138 | 618 | 519 | 99 | 195 | 46 | 2 | ... | 144 | 2 | 2 | 83 | 9 | 83 | 12 | 0.651 | 0.464 | 3 |
| 11 | 2017 | KIA | 0.342 | 142 | 629 | 514 | 98 | 176 | 36 | 3 | ... | 120 | 0 | 1 | 96 | 11 | 82 | 15 | 0.576 | 0.450 | 4 |
| 12 | 2018 | KIA | 0.339 | 143 | 609 | 528 | 92 | 179 | 34 | 1 | ... | 103 | 3 | 0 | 66 | 7 | 87 | 17 | 0.549 | 0.414 | 4 |
| 13 | 2019 | KIA | 0.300 | 136 | 555 | 456 | 65 | 137 | 31 | 1 | ... | 86 | 0 | 1 | 85 | 7 | 77 | 13 | 0.485 | 0.413 | 0 |
| 14 | 2020 | KIA | 0.354 | 140 | 600 | 522 | 93 | 185 | 37 | 1 | ... | 115 | 0 | 0 | 70 | 5 | 101 | 9 | 0.590 | 0.433 | 0 |
| 15 | 통산 | 통산 | 0.321 | 1708 | 7251 | 6194 | 1053 | 1986 | 421 | 14 | ... | 1335 | 27 | 18 | 873 | 95 | 1076 | 157 | 0.553 | 0.408 | 27 |
16 rows × 22 columns
위의 코드처럼 top10_hitter_dict에 상위10명의 선수 이름을 입력하면 통산기록을 가져올 수 있다.
import pickle
with open('top10_hitter_dict.pickle', 'wb') as file:
pickle.dump(top10_hitter_dict, file)
with open('korean_majorleaguer_dict.pickle', 'wb') as file:
pickle.dump(korean_majorleaguer_dict, file)
with open('korean_majorleaguer_df.pickle', 'wb') as file:
pickle.dump(korean_majorleaguer_df, file)
with open('top10_hitter_url_df.pickle', 'wb') as file:
pickle.dump(top10_hitter_url_df, file)
이제 위의 코드를 통해 모든 데이터들을 저장하자
한국인 메이저리거의 통산기록과 2020년 KBO 타율순 상위10명의 타자 선수들의 통산기록을 크롤링하여 가져와보았다.
이제 이 데이터들로 KBO에서 어떤 타자가 메이저리그로 진출 할 지 예측해보도록 하자
다음 데이터 스토리로 GO GO ~